Study of Chinese Text Similarity Based on Difference Factor in Word-Number

نویسندگان

  • Yan Niu
  • Qing Zou
  • Yaqing Han
چکیده

Text similarity calculation is the basic work in the application of Chinese information processing. A highquality text similarity calculation method must be accurate and efficient, that is, it can be able to compare texts from the level of text natural language meaning, and arrive at the similarity distinction similar to artificial reading based on a full understanding of the author or text source semantic. At the same time, it should also be an efficient algorithm to save the processing time in facing large amount of text information to be processed. Through the research of many domestic and foreign literature, analysis and further research on current situation of similarity calculation, this paper intended to present a new method to improve the performance of similarity calculation, namely a Chinese text similarity algorithm based on word-number difference, which combined the traditional based on statistics and the narrow semantic method that meant the combination of the statistical efficiency and semantic accuracy. Combining the advantages of statistics and semantic category also means the necessity to face and overcome disadvantages of the two kinds of methods. This paper attempted to take the difference in word-number as the breakthrough point, took advantage of the diversity of Chinese word-number, combining with the word frequency, number and meaning, in order to successfully extend the word similarity calculation to the text similarity calculation. Finally, introduced the self built small text set as test object, compared similarity calculation of different methods in the laboratory environment. It shows that the similarity calculation method based on difference in word-number performances better than the traditional methods based on statistical and semantic. Through artificial comparison of the test results of research on this topic in accuracy and speed of segmentation, provide a new approach for Chinese text similarity calculation

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Keyword Extraction From Chinese Text Based On Multidimensional Weighted Features

This paper proposed to solve the problems of incomplete coverage and low accuracy in keyword extraction of Chinese text based on intrinsic feature of the Chinese language and an extraction method of multidimensional information weighted eigenvalues. This method combined theoretical analysis and experimental calculation to study the parts of speech, word position, word length, semantic similarit...

متن کامل

The Research of Chinese Words Semantic Similarity Calculation with Multi-Information

Text similarity has a relatively wide range of applications in many fields, such as intelligent information retrieval, question answering system, text rechecking, machine translation, and so on. The text similarity computing based on the meaning has been used more widely in the similarity computing of the words and phrase. Using the knowledge structure of the and its method of knowledg...

متن کامل

The Effect of Pictorial Flashcards on the Sight Word Recognition in Kindergartens

It was a quasi-experimental study because the study involved in training participants in twoclasses each containing about 5 to 6 years old pre-primary students. To this end, fifty studentsparticipated in the study who were studying at Misagh School in Tabriz. In order to makesure of their homogeneity, the researcher administered a pre-test. Based on the results, 40students were selected as the ...

متن کامل

A Component Histogram Map Based Text Similarity Detection Algorithm

The conventional text similarity detection usually use word frequency vectors to represent texts. But it is high-dimensional and sparse. So in this research, a new text similarity detection algorithm using component histogram map (CHM-TSD) is proposed.This method is based on the mathematical expression of Chinese characters, with which Chinese characters can be split into components. Then each ...

متن کامل

The Dependence of Frequency Distributions on Multiple Meanings of Words, Codes and Signs

The dependence of the frequency distributions due to multiple meanings of words in a text is investigated by deleting letters. By coding the words with fewer letters the number of meanings per coded word increases. This increase is measured and used as an input in a predictive theory. For a text written in English, the word-frequency distribution is broad and fat-tailed, whereas if the words ar...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Journal of Multimedia

دوره 9  شماره 

صفحات  -

تاریخ انتشار 2014